Here is the data set.

The two-way ANOVA will test whether the independent variables (type of training [Yoga vs. Cardio vs. HIIT] AND Gender of training [F vs. M]) have an effect on the dependent variable (well-being - score). But there are some other possible sources of variation in the data that we want to take into account.

We are going to ask if there is an effect of type of training on participants’ well being scores

##    Happy_Score Sport_type Gender
## 1           20       yoga      m
## 2           75       yoga      w
## 3           70       yoga      w
## 4           55       yoga      m
## 5           65       yoga      w
## 6           35       yoga      m
## 7           25       yoga      w
## 8           65       yoga      w
## 9           50       yoga      m
## 10          55       yoga      w
## 11          75     cardio      m
## 12          40     cardio      w
## 13          55     cardio      m
## 14          77     cardio      w
## 15          80     cardio      m
## 16          70     cardio      m
## 17          35     cardio      w
## 18          50     cardio      m
## 19          60     cardio      w
## 20          80     cardio      m
## 21          25       HIIT      w
## 22          35       HIIT      w
## 23          45       HIIT      m
## 24          76       HIIT      m
## 25          30       HIIT      m
## 26          30       HIIT      w
## 27          40       HIIT      w
## 28          65       HIIT      m
## 29          55       HIIT      m
## 30          15       HIIT      m

I would always advice you to plot your data.

## Bin width defaults to 1/30 of the range of the data. Pick better
## value with `binwidth`.

And here it is for the flipped plot:

## Bin width defaults to 1/30 of the range of the data. Pick better
## value with `binwidth`.

mu_11_w = mean(dat$Happy_Score[dat$Sport_type == "yoga" & dat$Gender == "w"])
mu_11_m = mean(dat$Happy_Score[dat$Sport_type == "yoga" & dat$Gender == "m"])

mu_12_w = mean(dat$Happy_Score[dat$Sport_type == "cardio" & dat$Gender == "w"])
mu_12_m = mean(dat$Happy_Score[dat$Sport_type == "cardio" & dat$Gender == "m"])

mu_13_w = mean(dat$Happy_Score[dat$Sport_type == "HIIT" & dat$Gender == "w"])
mu_13_m = mean(dat$Happy_Score[dat$Sport_type == "HIIT" & dat$Gender == "m"])

The statistical model’s equation/notation is:

\(\color{red}{Y_{ijk} = \mu_{jk} + \alpha_{j} + \beta_{k} + \gamma_{jk} + \epsilon_{ijk}}\)

Looks scary, right?

Lets break it down:

Lets define the number of levels in every group:

k = 3 
j = 2

\(\alpha_{j}\) and \(\beta_{k}\) are based on the means for each population and each level:

Lets start with computing the means across all levels of all populations

Happy_Score Sport_type Gender
20 yoga m
75 yoga w
70 yoga w
55 yoga m
65 yoga w
35 yoga m
25 yoga w
65 yoga w
50 yoga m
55 yoga w
75 cardio m
40 cardio w
55 cardio m
77 cardio w
80 cardio m
70 cardio m
35 cardio w
50 cardio m
60 cardio w
80 cardio m
25 HIIT w
35 HIIT w
45 HIIT m
76 HIIT m
30 HIIT m
30 HIIT w
40 HIIT w
65 HIIT m
55 HIIT m
15 HIIT m
Happy_Score Sport_type Gender
20 yoga m
75 yoga w
70 yoga w
55 yoga m
65 yoga w
35 yoga m
25 yoga w
65 yoga w
50 yoga m
55 yoga w
75 cardio m
40 cardio w
55 cardio m
77 cardio w
80 cardio m
70 cardio m
35 cardio w
50 cardio m
60 cardio w
80 cardio m
25 HIIT w
35 HIIT w
45 HIIT m
76 HIIT m
30 HIIT m
30 HIIT w
40 HIIT w
65 HIIT m
55 HIIT m
15 HIIT m
Happy_Score Sport_type Gender
20 yoga m
75 yoga w
70 yoga w
55 yoga m
65 yoga w
35 yoga m
25 yoga w
65 yoga w
50 yoga m
55 yoga w
75 cardio m
40 cardio w
55 cardio m
77 cardio w
80 cardio m
70 cardio m
35 cardio w
50 cardio m
60 cardio w
80 cardio m
25 HIIT w
35 HIIT w
45 HIIT m
76 HIIT m
30 HIIT m
30 HIIT w
40 HIIT w
65 HIIT m
55 HIIT m
15 HIIT m
Happy_Score Sport_type Gender
20 yoga m
75 yoga w
70 yoga w
55 yoga m
65 yoga w
35 yoga m
25 yoga w
65 yoga w
50 yoga m
55 yoga w
75 cardio m
40 cardio w
55 cardio m
77 cardio w
80 cardio m
70 cardio m
35 cardio w
50 cardio m
60 cardio w
80 cardio m
25 HIIT w
35 HIIT w
45 HIIT m
76 HIIT m
30 HIIT m
30 HIIT w
40 HIIT w
65 HIIT m
55 HIIT m
15 HIIT m
Happy_Score Sport_type Gender
20 yoga m
75 yoga w
70 yoga w
55 yoga m
65 yoga w
35 yoga m
25 yoga w
65 yoga w
50 yoga m
55 yoga w
75 cardio m
40 cardio w
55 cardio m
77 cardio w
80 cardio m
70 cardio m
35 cardio w
50 cardio m
60 cardio w
80 cardio m
25 HIIT w
35 HIIT w
45 HIIT m
76 HIIT m
30 HIIT m
30 HIIT w
40 HIIT w
65 HIIT m
55 HIIT m
15 HIIT m

We will start with stating our hypotheses:

For factor 1:

\(H_{0}:\alpha_{j} = 0\) for all \(j\)

\(H_{1}:\alpha_{j} ≠ 0\) for at least one \(j\)

For factor 2:

\(H_{0}:\beta_{k} = 0\) for all \(j\)

\(H_{1}:\beta_{k} ≠ 0\) for at least one \(j\)

For an interaction:

\(H_{0}:\gamma_{jk} = 0\) for all combinations of \(jk\)

\(H_{1}:\gamma_{jk} ≠ 0\) for at least one combinations of \(jk\)


We continue with running the statistical model.

\[\color{red}{Y_{ijk} = \mu_{jk} + \alpha_{j} + \beta_{k} + \gamma_{jk} + \epsilon_{ijk}}\]

##                   Df Sum Sq Mean Sq F value Pr(>F)  
## Sport_type         2   2123  1061.4   3.644 0.0415 *
## Gender             1    103   102.8   0.353 0.5581  
## Sport_type:Gender  2   1895   947.6   3.253 0.0562 .
## Residuals         24   6990   291.3                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We finish with measuring the Effect sizes:

\(\color{red}{\eta^2}\).

The effect size used for ANOVA. It easures the proportion of the total variance in a dependent variable that is associated with the membership of different groups defined by an independent variables.

Partial \(\eta^2\) is a similar measure in which the effects of other independent variables and interactions are partialled out.

Remember the calculation of \(\eta^2\) for the one-factor anova?

If not… here it is: \(\eta^2= \frac{\sigma^2_{zw}}{\sigma^2_{tot}}\)

In the two-factor anova we extend the above-written formula to a similar one. However, this time we have to consider the new parameters: - \(\sigma^2_{factor1}\) - \(\sigma^2_{factor2}\) - \(\sigma^2_{interaction}\) - \(\sigma^2_{DV}\)

–>

The overall effectsize \(\eta^2_{tot} = \sigma^2_{factor1} + \sigma^2_{factor2} + \sigma^2_{interaction} + \sigma^2_{DV}\)

Partial \(\color{red}{\eta^2}\).

It measures the proportion of variance explained by a given variable of the total variance remaining after accounting for variance explained by other variables in the model.

library(DescTools)
EtaSq(mod_anova) 
##                        eta.sq eta.sq.part
## Sport_type        0.191030749  0.23291852
## Gender            0.009247787  0.01448637
## Sport_type:Gender 0.170568077  0.21329045

Example of an interpretation of the results: + \(\eta^2\) is the proportion of variance the treatment accounts for in the wellbeing of participants. In the case of the sport type, we see that 19.1% of the total variance of wellbeing is explained by sport participants did.
+ \(\eta_{part}^2\) is the proportion of variance the treatment accounts for in the wellbeing of participants. In the case of the sport type, we see that 23.3% of the variance is explained by the sport type once the gender and the interaction effect are “taken out”.


F-distribution & Hypothesis testing

For two-way ANOVA, the ratio between the mean sum of squares of a specific factor and the mean of sum of squares of the residuals (i.e., the variability within) is testd.

Let’s look at how the distribution and the critical values look like.

We use the pf() to calculate the area under the curve for the interval [0,4.226] and the interval [4.226,+∞) of a F curve with with \(v1=1\) and \(v2=24\)

x = sum_anova[[1]][["F value"]][1]
df_factor1 = 1
df_inn = 24
# interval $[0,1.5]
pf(x, df1 = df_factor1, df2 = df_inn, lower.tail = TRUE)
## [1] 0.9317056
x = sum_anova[[1]][["F value"]][1]
df_factor1 = 1
df_inn = 24
pf(x, df1 = df_factor1, df2 = df_inn, lower.tail = FALSE)
## [1] 0.06829437
library(latex2exp)
x <- rf(100000, df1 = 2, df2 = 24)
hist(x, 
     breaks = 'Scott', 
     freq = FALSE, 
     xlim = c(0,5), 
     ylim = c(0,1.5),
     xlab = '', 
     main = (TeX('Histogram for a $\\F$-distribution with $\\v_1 = 2$ and $\\v_2 = 24$ degrees of freedom (df)')), cex.main=0.9)

curve(df(x, df1 = 1, df2 = 24), from = 0, to = 5, n = 5000, col= 'red', lwd=3, add = T)
abline(v=4.227, col = "pink", lwd=2,  lty='dashed')
abline(v=.321, col = "blue", lwd=2,  lty='dashed')
abline(v=3.46, col = "green", lwd=2,  lty='dashed')

Hope that was helpful. Let me know what you think and if you want me to add anything to make your life easier!